Building a corpus of spoken Dutch

نویسنده

  • Nelleke Oostdijk
چکیده

In this paper the Spoken Dutch Corpus Project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overview of the project. It then goes on to describe the data that are available in the first release of the first part of the corpus that came out March 1st, 2000.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reduction of Dutch Sentences for Automatic Subtitling

We compare machine learning approaches for sentence length reduction for automatic generation of subtitles for deaf and hearing-impaired people with a method which relies on hand-crafted deletion rules. We describe building the necessary resources for this task: a parallel corpus of examples of news broadcasts of the Flemish VRT broadcasting corporation, and a Dutch shallow parser based on the ...

متن کامل

CGN, an annotated corpus of spoken Dutch

Although there are two variants of Dutch, the northern variant being the one used in the Netherlands and the southern variant in Flanders (Belgium), one corpus of spoken Dutch is under construction, the Spoken Dutch Corpus (CGN). In this paper first the principles of this corpus will be discussed, thereafter a few small case studies will show what the merits of such a corpus are.

متن کامل

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...

متن کامل

Using Large Multi-purpose Corpora for Specific Research Questions: Discourse Phenomena Related to Wh-questions in the Spoken Dutch Corpus

In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs fr...

متن کامل

The Spoken Dutch Corpus. Overview and First Evaluation

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999